Make graphs under R with ggplot2
Purpose
At the end of this session, you will be able to:
- Create different kinds of graphs
- Set titles and legends
- Change colors
- Combine plots
- Save plots
Data and graphs
The 80-20 rule: Graphics
Analysis graphs:
Happily, 20% of effort can give 80% of a desired result (default
settings for plots often give something reasonable).
Presentation graphs:
Sadly, 80% of total effort may be required to give the remaining 20% of
your final graph:
- Graph title, axis and value labels
- Color, shape, size of point symbols, line style, width of lines…
- Legends to connect the data in the graph to interpretation
- …customization is almost infinite
Data
To draw a graph we need data into a table format. A data table, or data.frame, has several rows (also called observations) and columns (also called variables). The columns of a data table can be of different types, and are named. Within a column, all values must be of the same type.
For the rest of the course, we will use the
Clinical_Cohort.csv table which you can find there.
To load a table, we can use the read.table()
function from the R package named utils:
As every time you load a table for the first time in R, you have to look at what it looks like. For that, R provides a whole set of very useful functions.
## Subject_number Sex Age Tabaco Diabetes Hypertension NSCLC_type
## 1 P0021 M 87.98 former no yes adenocarcinoma
## 2 P0061 F 73.35 former yes yes squamous cell carcinoma
## 3 P0129 M 87.26 former no no adenocarcinoma
## 4 P0097 M 85.05 former no yes squamous cell carcinoma
## 5 P0100 M 84.95 former no no adenocarcinoma
## 6 P0050 M 83.67 former no yes squamous cell carcinoma
## Initial_stage Histology MTS_other
## 1 IV Non small cell lung cancer (NSCLC) node (adrenal)
## 2 IV Non small cell lung cancer (NSCLC) node, (liver)
## 3 IV Non small cell lung cancer (NSCLC) cancer bilateral
## 4 IV Non small cell lung cancer (NSCLC) node (liver)
## 5 IV Other node
## 6 II Non small cell lung cancer (NSCLC) nodal, pleura
## Date_of_Diagnosis Previous_immunotherapy Previous_radiotherapy
## 1 12/10/2019 yes no
## 2 22/10/2018 yes yes
## 3 01/11/2017 no no
## 4 13/08/2019 yes yes
## 5 29/01/2021 no no
## 6 26/09/2018 no yes
## Total.treatment.lines Immunotherapy_name Progression LFU_status
## 1 1 PEMBROLIZUMAB no Partial remission
## 2 2 PEMBROLIZUMAB yes Disease progression
## 3 3 ATEZOLIZUMAB yes death
## 4 2 PEMBROLIZUMAB yes Partial remission
## 5 1 PEMBROLIZUMAB yes Progression disease
## 6 1 PEMBROLIZUMAB yes Disease progression
## AE_1 Mutation_Type Last_contact_date Death_or_alive OS Comments
## 1 <NA> KRAS 26/01/2022 alive 0.06666667
## 2 <NA> KIT 22/08/2020 dead 0.20000000
## 3 asthenia KRAS 14/04/2022 dead 0.46666667
## 4 Anemia ALK 23/06/2021 alive 0.63333333
## 5 weightloss KRAS 02/07/2021 dead 0.80000000
## 6 <NA> <NA> 03/07/2021 dead 0.83333333
## [1] 122 23
## 'data.frame': 122 obs. of 23 variables:
## $ Subject_number : chr "P0021" "P0061" "P0129" "P0097" ...
## $ Sex : chr "M" "F" "M" "M" ...
## $ Age : num 88 73.3 87.3 85 85 ...
## $ Tabaco : chr "former" "former" "former" "former" ...
## $ Diabetes : chr "no" "yes" "no" "no" ...
## $ Hypertension : chr "yes" "yes" "no" "yes" ...
## $ NSCLC_type : chr "adenocarcinoma" "squamous cell carcinoma" "adenocarcinoma" "squamous cell carcinoma" ...
## $ Initial_stage : chr "IV" "IV" "IV" "IV" ...
## $ Histology : chr "Non small cell lung cancer (NSCLC)" "Non small cell lung cancer (NSCLC)" "Non small cell lung cancer (NSCLC)" "Non small cell lung cancer (NSCLC)" ...
## $ MTS_other : chr "node (adrenal)" "node, (liver)" "cancer bilateral" "node (liver)" ...
## $ Date_of_Diagnosis : chr "12/10/2019" "22/10/2018" "01/11/2017" "13/08/2019" ...
## $ Previous_immunotherapy: chr "yes" "yes" "no" "yes" ...
## $ Previous_radiotherapy : chr "no" "yes" "no" "yes" ...
## $ Total.treatment.lines : int 1 2 3 2 1 1 3 1 1 6 ...
## $ Immunotherapy_name : chr "PEMBROLIZUMAB" "PEMBROLIZUMAB" "ATEZOLIZUMAB" "PEMBROLIZUMAB" ...
## $ Progression : chr "no" "yes" "yes" "yes" ...
## $ LFU_status : chr "Partial remission" "Disease progression" "death" "Partial remission" ...
## $ AE_1 : chr NA NA "asthenia " "Anemia" ...
## $ Mutation_Type : chr "KRAS" "KIT" "KRAS" "ALK" ...
## $ Last_contact_date : chr "26/01/2022" "22/08/2020" "14/04/2022" "23/06/2021" ...
## $ Death_or_alive : chr "alive" "dead" "dead" "alive" ...
## $ OS : num 0.0667 0.2 0.4667 0.6333 0.8 ...
## $ Comments : chr "" "" "" "" ...
## Subject_number Sex Age Tabaco
## Length:122 Length:122 Min. :31.72 Length:122
## Class :character Class :character 1st Qu.:58.21 Class :character
## Mode :character Mode :character Median :70.36 Mode :character
## Mean :67.06
## 3rd Qu.:75.10
## Max. :87.98
## Diabetes Hypertension NSCLC_type Initial_stage
## Length:122 Length:122 Length:122 Length:122
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Histology MTS_other Date_of_Diagnosis
## Length:122 Length:122 Length:122
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Previous_immunotherapy Previous_radiotherapy Total.treatment.lines
## Length:122 Length:122 Min. : 0.000
## Class :character Class :character 1st Qu.: 1.000
## Mode :character Mode :character Median : 2.000
## Mean : 2.344
## 3rd Qu.: 3.000
## Max. :11.000
## Immunotherapy_name Progression LFU_status AE_1
## Length:122 Length:122 Length:122 Length:122
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Mutation_Type Last_contact_date Death_or_alive OS
## Length:122 Length:122 Length:122 Min. : 0.06667
## Class :character Class :character Class :character 1st Qu.: 2.35000
## Mode :character Mode :character Mode :character Median : 8.90000
## Mean :13.08224
## 3rd Qu.:21.48333
## Max. :45.90000
## Comments
## Length:122
## Class :character
## Mode :character
##
##
##
Graphs definition
A graph is a visual representation of the variation of one variable
(such as a series of one or more points, lines, line segments, etc.)
relative to the variation of one or more other variables.
Represented in 2 dimensions for 2 variables, therefore along 2 axes: a
horizontal axis named x and a vertical axis named
y.
R Graphics from R base
R has many innate graphics capabilities that come with it. These are
called base graphics
since they are technically included in the base
package, which comes with R and is automatically loaded when you open
it.
A good example of base graphics
is the plot()
function, which – you guess it – can make some basic plots. For example,
you can make a scatterplot of two vectors (“Age” and “OS”) with the plot()
function:
There are tons of options you can give to the
plot()
function, see ?plot for a non-exhaustive, and yet still
exhausting, listing of options that I won’t talk about here.
But to give a small sampling:
graphics::plot(x = clinical_data$Age, y = clinical_data$OS,
type = "l",
col = "darkblue",
lty = 5,
lwd = 1.5,
main = "Correspondance between the patients' Age and their Overall Survival",
sub = "A graphical test",
xlab = "Age",
ylab = "Overall Survival")
Other base graphics functions include
hist()
to make histograms,boxplot()
to make box plots, barplot()
to make bar plots, pie()
to make pie charts…
And lines()
to add a line on the graph, legend()
to add a legend…
More information here: http://www.sthda.com/english/wiki/r-base-graphs
R Graphics from ggplot2 R package
There are packages other than base
that create graphics.
The two most popular are lattice
and ggplot2.
To install them you simply use, for example,
install.packages("ggplot2").
In this course we’ll use ggplot2
exclusively.
There are many reasons why we will prefer ggplot2
to base graphics,
but the most important are the following:
- It’s easier to learn, since the names of the plotting functions are more systematic.
- It’s much more popular in use.
- It has a much better documentation system, see http://docs.ggplot2.org/current/.
It is also much more extensible: it’s easier to add on your own graphics.
The ggplot2 syntax
First of all, load the package:
ggplot2
works also with data in data.frame
format.
A plot has 3 main components:
- a dataset
- a set of aesthetics (what we want to represent in the plot: columns of the dataset)
- a set of layers (mainly geometries: graphical aspects)
As a first plot, we represent the “Age” of patients according to
their overall survival (e.g “OS”) (same as that plotted with R base graphics
before):
ggplot(clinical_data, ## data
aes(x = Age, y = OS) ## aesthetics
) +
geom_point() ## a layer of geometryThe ggplot()
function initiates the plot. Here we use the clinical_data data.frame
with “Age” and “OS” columns as aesthetics (aes()
function) and one layer of geometry ( geom_point()
function) to draw points.
We add components (such as layers) to the plot with +.
We can add as much layers as we want.
For example, if we want to add a line to our first plot, we add a layer
with geom_line().
The ggplot()
function returns a ggplot object which can be stored as a variable to be
used later or be build step by step.
To display the plot we have to print the object, either by directly
calling it in the console or applying the print()
function.
If several geometries are defined, they will be plotted in order, possibly hiding other layers.
Remember: We can add arguments to geometry function to customize it. Here
linewidthallows to change the width of the line, andcolorto change the color of the points.
Geometries: different kinds of plots (layers)
Several type of geometries are defined in ggplot2
and can be classified according to the type and number of variables used
in the plot (restricted list below).
- one variable
- continous
- discrete
- two variables
- continous X, continuous Y
- discrete X, continuous Y
- bivariate distribution
- continuous function
- three variables
Definition:
A discrete variable only allows a particular set of values, and in-between values are not included (like Sex, Pathology etc). A continuous variable can be any value in a range (heigt, weight, etc). It is when we get numbers with comma. The Age can be considered as both: it is a range from 0 to 120 (yes, I’m optimist!) so it is a continuous variable, but if we don’t considere half year, it could be discrete (0, 1, 2, 3, 4, … 120).
Note that layers are functions that can contain arguments allowing us to customize our graphs.
Histogram / Density
Histograms and densities allow to represent distributions (number of
each value of observations).
For these geometries, only the x aesthetic is
mandatory.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can adjust the number of bins or their width with
bins and binwidth functions. We can also
manually set the breaks.
ggplot(clinical_data, aes(x = Age)) + geom_histogram(binwidth = 10) # the width of each group is 10 (here 10 years for the Age)Exercice :
Draw a density graph on the “Age” of the clinical_data dataset.
Tip
The result should looks like:
Bar graph (geom_bar versus geom_col)
There are two types of bar charts: geom_bar()
makes the height of the bar proportional to the number of cases in each
group (similar to histograms but for discrete variables). If you want
the heights of the bars to represent values in the data, use geom_col()
instead.
geom_bar()
geom_bar()
counts the number of female and male patients in the dataset.
geom_col()
geom_col()
represent the value in the data. To draw this plot, I selected the 2
first patients on the dataset, one male patient of 87.98yo and one
female patient of 73.35yo.
## Sex Age
## 1 M 87.98
## 2 F 73.35
Question :
What does y value correspond to, if I give my entire dataset to geom_col()?
(The value “F” and “M” are represented several times, so we have several
“Age” values for each “Sex” category)
Answer
It makes the sum of all the ages for each category of Sex! Not very
interesting…
To check that, we can compute these sums:
## [1] 3243.83
## [1] 4937.95
Be careful when you make plots, sometimes R don’t return an error, but
it could to not give you what you want!
Boxplot / Violinplot /Jitter
Boxplots and Violinplot are graphs summarising a set of data, where
the shape shows how the data is distributed.
Boxplots show quartiles of the data (0% to 25% of the data correspond to
the first quartile; 25% to 50% correspond to the second quartile, etc).
A rectangle is drawn to represent the second and third quartiles,
usually with a horizontal line inside to indicate the median value
(e.g. 50%). The lower and upper quartiles are shown as vertical lines
above and below the rectangle.
For violinplot, the shape shows the number of observations for each
values.
Boxplots are drawn by geom_boxplot();
violin plots are drawn with geom_violin().
Jitter draws directly each values by points.
Exercice :
Draw a violinplot on the “Sex” and “Age” of the clinical_data dataset,
with values too (shape and points).
Scatterplots
Scatterplots are drawn by the geom_point()
layer. That is the first plot that we saw at the beginning of this
course.
More complex plot: aes()
Aesthetics basics
Aesthetics are the set of visual properties of the plot mapped to variables or set to fixed values.
We can define several aesthetics for a plot such as:
xyzcolorfillsizetextlabelshapelinetypealphagroup
Each type of geometry uses some of the above aesthetics.
The aesthetics define inside the aes()
function will refer to columns of the data.frame used, and allow us to
add a third information on the graph.
Color and fill
color and fill aesthetics can be mapped to
continuous or discrete variables, resulting on (respectively) a gradient
or a palette of colors with default values (blue gradient or rainbow
colors). Depending on the geometry used, we can specify
color, fill or both.
Scatterplot
So, if we want to color the points of the scatterplot according to
another column, we add the color aesthetic in aes()
(a gradient of colors is automatically chosen):
# palette scale: here 2 colors for Sex
ggplot(clinical_data, aes(x = Age, y = OS, color = Sex)) + geom_point()If we want to define the color for data, it must be defined outside
aes().
For example, if we want to color all the points in blue, we define the
color aesthetic outside the aes()
function.
Question :
What happens if I set color = "blue" inside the aes()
function?
Answer
If we try to define the color as “blue” inside the aes(),
ggplot()
will map the color to a variable with only contains the word “blue” and
set a default color to it.
Histogram and boxplot
Now, if we want to color the bar of the histogram according to
another column, we add the fill aesthetic in aes()
(a palette of colors is automatically chosen).
Exercice :
Draw again the histogram of the “Age” of the clinical_data dataset, but
here according to the “Sex” thanks to the fill
aesthetic.
Answer
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, if we want to color the lines of the bar of the histogram
according to another column, we add the color aesthetic in
aes()
(a palette of colors is automatically chosen).
Exercice :
Draw again the histogram of the “Age” of the clinical_data dataset, but
here according to the “Sex” thanks to the color
aesthetic.
Answer
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Exercice :
Draw again the previous boxplot of the “Age” according to the “Sex”, and
according to the “Death_or_alive” status, thanks to the
fill aesthetic.
Similar to histogram, color can be used to color the
lines of the boxplot instead of the total box.
We can combine fill and color, but I think
that color is pretty difficult to reads for histogram and
boxplot.
ggplot(clinical_data, aes(x = Sex, y = Age, fill = Death_or_alive, color = Hypertension)) + geom_boxplot()Point shapes
The shapes of point is controlled by the shape
aesthetic.
If we want to change the shape of the points of the scatterplot
according to another column, we add the shape aesthetic in
aes().
A palette of shape is automatically chosen. 25 different shapes are
available, defined by a number. The hollow shapes (0–14) have a border
determined by color; the solid shapes (15–18) are filled with color; the
filled shapes (21–24) have a border of color and are filled with fill.
For example, we can plot the shape of “Death_or_alive” column on the scatterplot:
To change the shape, use the scale_shape_manual()
function, like:
ggplot(clinical_data, aes(x = Age, y = OS, shape = Death_or_alive)) + geom_point() + scale_shape_manual(values=c(3, 15)) #have to set 2 values (3 and 15) because Death_or_alive has 2 different valuesQuestion : Can we combine color (or
fill) and shape inside aes()?
Tip
Here is a graph:
color and shape inside aes(),
with “Sex” and “Death_or_alive” for example.
Answer
Of course we can combine them!
Line types
The type of lines are specified by a name or a number (from 1 to 6).
For example, we can use different types of line for each “Sex” value on the line plot:
To change the linetype, use the scale_linetype_manual()
function, like:
ggplot(clinical_data, aes(x = Age, y = OS, linetype = Sex)) + geom_line() + scale_linetype_manual(values=c("twodash", "dotted")) #have to set 2 values ("twodash" and "dotted") because Sex has 2 different valuesComputing on aesthetics
ggplot2
is able to directly compute new variables from existing ones inside aes().
Example, to plot the ending age of patients (Age + OS) according to the
“Age” (of diagnostic), we can do:
Customization plots (layers and aes)
Title, subtitle, caption and axis names
We can control the plot title, subtitle, caption and axes labels with
the labs()
function.
ggplot(clinical_data, aes(x = Age, y = OS)) +
geom_point() +
labs(title = "This is a title", subtitle = "and this is a subtitle",
caption = "and a caption", tag = "Fig. 1", x = "x label", y = "y label")
Other way to code some of these features:
ggplot(clinical_data, aes(x = Age, y = OS)) +
geom_point() +
ggtitle("This is a \n new title", subtitle = "and this is a \n new subtitle") +
xlab("x label") +
ylab("y label")Note: to set a title on 2 lines add the line separator
\nin your text.
Axis / Scales
Scales transformations
A series of transformation (log, square root) is implemented for
scales, see scale_x_log10(),
scale_y_log10(),
scale_x_sqrt(),
scale_y_sqrt()
and related function.
We can revert a scale with scale_x_reverse()
or scale_y_reverse().
Discrete scales
Discrete scale are handled by scale_<axis>_discrete().
We can change the order of levels, restrict them or add some new
ones.
# change order of levels
ggplot(clinical_data, aes(x = Sex)) + geom_bar() + scale_x_discrete(limits = c("F", "M"))# restrict levels: plot only female
ggplot(clinical_data, aes(x = Sex)) + geom_bar() + scale_x_discrete(limits = c("F"))## Warning: Removed 71 rows containing non-finite outside the scale range
## (`stat_count()`).
# add new level
ggplot(clinical_data, aes(x = Sex)) + geom_bar() + scale_x_discrete(limits = c("F", "M", "Other"))Continuous scales
Continuous scale are handled by scale_<axis>_continuous().
We can change the number of values on the axis:
ggplot(clinical_data, aes(x = Age, y = OS)) + geom_point() + scale_y_continuous(breaks=seq(0, 50, by=5))We can add a second axis, thanks to the the sec.axis
argument of the scale_<axis>_continuous()
function. The second axis can be identical to the first one, or a
computation from the first one:
ggplot(clinical_data, aes(x = Age, y = OS)) +
geom_point() +
scale_y_continuous(name = "OS (in years)", #name of the first axis
sec.axis = sec_axis(transform=~.*12, name="OS (in months)")) #add a second axisExercice :
Redo the previous graph without transformation of the second axis (the
second axis have to be the same as the first one).
Tip
You have to keep thetransform argument to give it the data
to use to trace the axis.
Answer
ggplot(clinical_data, aes(x = Age, y = OS)) +
geom_point() +
scale_y_continuous(name = "OS (in years)",
sec.axis = sec_axis(transform=~.))Zooms
There are two ways to zoom in/out a plot depending on whether the
data are clipped or not:
* xlim()
and ylim()
performed a zoom in the data (removes unseen data points)
* coord_cartesian()
performed a zoom in the plot without removing data (preferred)
ggplot(clinical_data, aes(x = Age, y = OS)) + geom_point() + xlim(c(50, 80)) #be careful: remove data before 50 and after 80 to zoom## Warning: Removed 26 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(clinical_data, aes(x = Age, y = OS)) + geom_point() + coord_cartesian(xlim = c(50, 80)) #just make a zoom on 50 to 80 x axis dataColors
In R, colors can be specified either by name (e.g
col = "red") or as a hexadecimal RGB triplet (such as
col = "#FFCC00"). You can also use other color systems such
as ones taken from the RColorBrewer
package.
Manual colors
When we need to indicate a color to R, we can mention certain colors in full such as “red” or “blue”. The list of colors recognized by R is available at http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf.
We can set manually the colors with scale_fill_manual()
and scale_color_manual()
depending on if you are using fill or color in
aesthetic in ggplot()
function.
ggplot(clinical_data, aes(x = Sex, fill = Death_or_alive)) +
geom_bar() +
scale_fill_manual(values = c("dead" = "cyan4", "alive" = "deeppink3")) # "dead" and "alive" values of the "Death_or_alive" column set in the fill functionIn computing, colors are usually coded as Red/Green/Blue (see https://en.wikipedia.org/wiki/RGB_color_model) and
represented by a 6-character hexadecimal code, preceded by the symbol
#. This code is recognized by R, we can for example
indicate “#FF0000” for the color red. The hexadecimal code of the
different colors can be easily obtained on the internet, many sites
being devoted to color palettes.
Personally, I like this one: https://htmlcolorcodes.com/
ggplot(clinical_data, aes(x = Sex, fill = Death_or_alive)) +
geom_bar() +
scale_fill_manual(values = c("dead" = "#da330f", "alive" = "#12c179")) # "dead" and "alive" are the values of the "Death_or_alive" column set in the fill functionPalettes
R natively provides some continuous color palettes that we can use by
their name, such as rainbow, heat.colors,
terrain.colors, topo.colors and
cm.colors. But the RColorBrewer
package is an unavoidable tool to manage colors with R. It offers
several color palettes (see https://r-graph-gallery.com/38-rcolorbrewers-palettes.html).
If you use ggplot2,
the RColorBrewer
palettes are directly available via the scale_fill_brewer()
and scale_colour_brewer()
functions:
ggplot(clinical_data, aes(x = Sex, fill = NSCLC_type)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2")ggplot(clinical_data, aes(x = Sex, fill = NSCLC_type)) +
geom_bar() +
scale_fill_brewer(palette = "Paired")
> Note:
RColorBrewer
palettes are only implemented for discrete variables.
Gradients
Gradients ar used to color continues variables.
Several function are used to set the gradients properties, depending on
the number of color to set (2, 3 or n):
#2 colors
ggplot(clinical_data, aes(x = Age, y = OS, color = OS)) +
geom_point() +
scale_color_gradient(low = "green", high = "red")#3 colors
ggplot(clinical_data, aes(x = Age, y = OS, color = OS)) +
geom_point() +
scale_color_gradient2(low = "green", mid = "blue", high = "red", midpoint = 20)#n colors
ggplot(clinical_data, aes(x = Age, y = OS, color = OS)) +
geom_point() +
scale_color_gradientn(colors = c("blue", "red", "green", "yellow"))We can get more control on the gradient with the breaks
and limits arguments.
Other palettes
Other palettes exist, in particular those of the Viridis family whose
colors are distinguished by the most common forms of color blindness.
They are also implemented in gpplot2 via the functions scale_fill_viridis_c()
and scale_colour_viridis_c()
for continuous variables and scale_fill_viridis_d()
and scale_colour_viridis_d()
for discrete variables. These functions can take different values in the
option argument: “magma”, “inferno”, “plasma”, or
“viridis”.
#for continuous data
ggplot(clinical_data, aes(x = Age, y = OS, color = OS)) +
geom_point() +
scale_color_viridis_c(option="magma")#for discrete data
ggplot(clinical_data, aes(x = Sex, fill = NSCLC_type)) +
geom_bar() +
scale_fill_viridis_d(option="viridis")Lines
Line types
To change the type and the size of the line, we can use the
linetype and linewidth options:
Add Reference lines
We can draw some lines on the plot by specifying their slope and
intercept to geom_abline()
(like in a*x+b mathematical function).
To draw horizontal and vertical lines, use geom_hline()
and geom_vline()
respectively.
ggplot(clinical_data, aes(x = Age, y = OS)) + geom_point() + geom_abline(slope = 0.05, intercept = 2)ggplot(clinical_data, aes(x = Age, y = OS)) + geom_point() + geom_hline(yintercept = 20, linetype = "dashed")ggplot(clinical_data, aes(x = Age, y = OS)) + geom_point() + geom_vline(xintercept = 45, color="red")As previously, we can change the linetype,
size and color of the line.
Points
Transparency
The transparency is managed by the alpha aesthetic. It
can be mapped to a continuous variable or set with a value between 0
(totally transparent) and 1 (totally opaque).
Colors, shape and size
As previously, we can change the shape,
size and color of the line.
ggplot(clinical_data, aes(x = Age, y = OS)) + geom_point(alpha = 0.5, shape = 17, color = "darkgreen", size = 3)Text and labels
Text can be written with geom_text()
or geom_label()
(set a box around the text).
To plot dots and labels:
ggplot(clinical_data, aes(x = Age, y = OS, label = Subject_number)) + geom_point() + geom_text(nudge_x = 3, nudge_y = 1)
nudge_x and nudge_y allow to shift labels to
the right and up.
We can add label manually by:
my_labels <- data.frame(my_specific_labels=c("It goes UP!","It goes DOWN!"))
ggplot(clinical_data, aes(x = Age, y = OS)) + geom_point() + geom_label(
data = my_labels,
aes(label = my_specific_labels),
x = c(35,75), # x coordinates of each label
y = c(25,20), # y coordinates of each label
label.padding = unit(0.55, "lines"), # rectangle size around label (let more space)
label.size = 0.35, # size text
color = "black", # color text
fill = c("#54a30a", "#b01515") # color box
)Or to add one text:
ggplot(clinical_data, aes(x = Age, y = OS)) + geom_point() +
geom_text(data = NULL,
x = 80, # x coordinates of the text
y = 35, # y coordinates of the text
label = "This is AWSOME!!", # the text
fontface=3, # italic
size=8, # size text
color = "#b01515", # color text
angle = 320) # angle textBars
Bar position
By default the position of bars are stacked, but we can change these
position to dodge (side by side) or
fill (scale to 1, similar to 100%).
Labeling on bars
Similar to geom_point()
we can add label on each bar.
Here an example to add the number counted of dead or
alive patients by ggplot:
ggplot(clinical_data, aes(x = Sex, fill = Death_or_alive)) +
geom_bar(position = "stack") +
geom_text(aes(label = stat(count)), stat = "count", position = position_stack(vjust = 0.5))## Warning: `stat(count)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Note that you have to specify the position of labels depending on the
position of the bar (“stack”, “dodge” or “fill”) thanks to the position_stack(),
position_dodge()
or position_fill()
functions. The vjust and width option allow to
place labels more precisely, a little up/down/right/left related to the
general position given by the position* functions.
ggplot(clinical_data, aes(x = Sex, fill = Death_or_alive)) +
geom_bar(position = "dodge") +
geom_text(aes(label = stat(count)), stat = "count", position = position_dodge(width = 0.9), vjust = 1.5)ggplot(clinical_data, aes(x = Sex, fill = Death_or_alive)) +
geom_bar(position = "fill") +
geom_text(aes(label = stat(count)), stat = "count", position = position_fill(vjust = 0.5))Themes
Every aspect not related to the data of the plot can be customized.
We can change the background color, the font sizes, the legend position…
The set of characteristics of a plot is called a theme and
can be changed by the theme()
function. Some themes are predefined, but we can customize every single
element by ourselves from scratch or from a predefined theme.
Two predefined themes are useful for publications: theme_bw()
and theme_classic().
Example of changing a theme from a predefined theme:
ggplot(clinical_data, aes(x = Sex, fill = Sex)) +
geom_bar() +
theme_classic() +
theme(
axis.line.x.bottom = element_line(color="blue"),
axis.line.y.left = element_line(color="darkviolet"),
axis.title = element_text(color = "darkorange3", face = "bold", size=20)
)Example of creating a theme from scratch:
ggplot(clinical_data, aes(x = Sex, fill = Sex)) + geom_bar() + theme(
legend.position = "bottom",
panel.background = element_rect(fill = "white", color = "gray50"),
axis.text = element_text(color = "blue", face = "bold"),
axis.text.x = element_text(angle = 60, hjust = 1)
)You can also create your own theme, to avoid copying/pasting all the customization on each of your graphs:
# creation of the theme
mytheme <- theme_classic() + # you can start from an existing theme to set up some basic elements
theme(plot.title = element_text(colour = "firebrick3", size = rel(2)),
plot.background = element_rect(fill = "gray70"),
legend.position = "left",
legend.box.background = element_rect(color = "darkblue"),
legend.title = element_text(face = "bold", color = "darkslateblue"),
legend.text = element_text(size = 8, colour = "deeppink2",face = "bold")
)
#apply the theme
ggplot(clinical_data, aes(x = Sex, fill = Sex)) + geom_bar() + labs(title = "My ugly plot!!") + mytheme
There are far too many
theme
elements built into the ggplot2
library to mention here, but you can find a complete list in the theme
documentation: https://ggplot2.tidyverse.org/reference/theme.html
Facetting
A key feature of ggplot2
is its ability to easily produce faceted plot, where each panel
represents a subset of the data.
Facetting on one variable
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
facet_wrap()
tries to fit all the panels in a rectangle, automatically choosing the
number of rows and columns. We can specify the number of rows/columns
with nrow and ncol options.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
By default, all panels have the same scales. We can set free scales
for each panel by setting the scales parameter.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Be careful, here plots are difficult to compare between Sex because scales are different now.
Faceting on two variables
Faceting on two variables can be achieved by facet_wrap()
or by facet_grid()
with two different behaviors. Note that facet_wrap()
will drop the non-existing combinations of levels where facet_grid()
produces empty panels for them.
ggplot(clinical_data, aes(x = Age, color = Initial_stage)) + geom_histogram() + facet_wrap(Sex ~ Initial_stage, ncol = 4)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(clinical_data, aes(x = Age, color = Initial_stage)) + geom_histogram() + facet_grid(Sex ~ Initial_stage)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We have no data for male patient in grade I, so with facet_wrap()
we have no plot, but with facet_grid()
we have empty plot.
Note that you can facet with more than 2 variables: ggplot(clinical_data, aes(x = Age, color = Initial_stage)) + geom_histogram() + facet_grid(Sex ~ Initial_stage+NSCLC_type)
Change labels into facet
Facet labels can be modified using the option labeller,
which should take a function.
In the following R code, facets are labelled by combining the name of
the grouping variable with group levels. The label_both
function is used.
ggplot(clinical_data, aes(x = Age, color = Initial_stage)) + geom_histogram() + facet_grid(Sex ~ Initial_stage, labeller = label_both)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Note that you can facet with more than 2 variables, you should set
labellertolabel_context: ggplot(clinical_data, aes(x = Age, color = Initial_stage)) + geom_histogram() + facet_grid(Sex ~ Initial_stage+NSCLC_type, labeller = label_context)
More information about labeling in facet: https://www.datanovia.com/en/blog/how-to-change-ggplot-facet-labels/
Combine plots with patchwork
So far, we’ve used facets to split our chart into multiple viewports. However, this is limited to plotting the same variables from the same dataset.
The patchwork()
package (installation by
devtools::install_github("thomasp85/patchwork")) makes it
easy to arrange separate ggplots in the same frame with +
(arrange the graphs next to each other), / (arrange one
graph on top of the other), () (group this arrangement of
graphs) as if you were writing an equation.
library(patchwork)
p1 <- ggplot(clinical_data, aes(x = Age)) + geom_histogram() + labs(title = "Plot1")
p2 <- ggplot(clinical_data, aes(x = Sex)) + geom_bar() + labs(title = "Plot2")
p3 <- ggplot(clinical_data, aes(x = Initial_stage)) + geom_bar() + labs(title = "Plot3")
p1 + p2## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It is often necessary to add general titles, captions, tags, etc. to
a composition. This can be achieved by adding a plot_annotation()
to the patchwork:
p1 + p2 + plot_annotation(title = 'Distributions of clinical data',
subtitle = 'These 2 plots will reveal yet-untold secrets about our beloved data-set',
caption = 'Disclaimer: None of these plots are insightful',
tag_levels = 'A', tag_prefix = 'Fig. ',
theme = theme(plot.title = element_text(color = "red"))
)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
To add a theme to the whole combined graph, we can use the theme()
function (as usual) but with the & symbol:
p1 + p2 + plot_annotation(title = 'Distributions of clinical data',
subtitle = 'These 2 plots will reveal yet-untold secrets about our beloved data-set',
caption = 'Disclaimer: None of these plots are insightful',
tag_levels = 'A', tag_prefix = 'Fig. ',
theme = theme(plot.title = element_text(color = "red")),
) & theme(plot.tag = element_text(color = "orange"),
plot.title = element_text(colour = "red"),
axis.line.x.bottom = element_line(color="blue"),
axis.line.y.left = element_line(color="darkviolet")
)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here we presented the most important features to you, however patchwork()
allows to do more customization: give more space to certain plots rather
than others, display one plot superimposed on another (for example in
the corner of another plot), merge legends,…
Look the official documentation of patchwork()
for more information: https://patchwork.data-imaginist.com/index.html
Save graphs
In RStudio, there are many options available to you to save your
figures.
You could copy them to the clipboard, but it is preferable to export
them as a file type of your choice (or export them as any file type
(png, jpg, tiff, pdf, etc).
You can do it by the RStudio interface with the Export button on the
menu in Plots panel (lower right panel), but also by command line.
ggsave()
is a useful command that saves directly to your working directory (or
the absolute path if you give it in the filename) and
allows you to specify the name of your new file, the dimensions of the
plot (with width and height options), the
resolution (with res option), etc.
ggsave(filename = "myplot_p1_with_ggsave.png", plot = p1, width = 3, height = 3)
ggsave(filename = "myplot_p1_with_ggsave.pdf", plot = p1, width = 10, height = 3)ggsave()
allow to save only one graph, so in cases where you produce many graphs
(combined or not by patchwork), you should use jpeg(),
png(),
tiff(),
bmp()
or pdf()
as follow:
- Specify files to save your image using one of the function above. Additional argument indicating the width and the height of the image can be also used.
- Create the plot (or plot it if it is already created).
- Close the file with dev.off(). Note that we need to call this function after all the plotting, to save the file and return control to the screen.
Final Exercises
The exercises use the same dataset that has been used until now, and
need only ggplot2,
patchwork()
and saving functions.
Remember, draw your graph step by step.
Question 1A :
Redo this plot:
Question 1B :
Change the code of the question 1A to change the main title, axis name
and flip the graph, to draw this plot:
Answer
ggplot(clinical_data, aes(x = NSCLC_type , fill = Initial_stage)) +
geom_bar() +
coord_flip() +
labs(title = "Distribution of the NSCLC type by Initial stage", x = "", y = "Count")Question 1C :
Adapt the code of the question 1B to change the background color, draw
axis, order “NSCLC_type” by alphabetical order, and change bar color, to
draw this plot:
Tip
To get the alphabetical order, you can use the unique()
function (?unique) to get unique values from a vector (of
“NSCLC_type” values), then you can use the sort()
function with its decreasing argument
(?sort).
Answer
# get NSCLC type in alphabetical order
NSCLC_ordered <- sort(unique(clinical_data$NSCLC_type), decreasing = TRUE)
# draw plot
ggplot(clinical_data, aes(x = NSCLC_type , fill = Initial_stage)) +
geom_bar() +
coord_flip() +
labs(title = "Distribution of the NSCLC type by Initial stage", x = "", y = "Count") +
theme_classic() +
scale_x_discrete(limits = NSCLC_ordered) +
scale_fill_manual(values = c("I" = "#f39c12", "II" = "#117a65", "III" = "#2980b9", "IV" = "#884ea0"))Question 2A :
Redo this plot:
Answer
ggplot(clinical_data, aes(x = Age, y = OS, color = Initial_stage, shape = Death_or_alive)) + geom_point()Question 2B :
Adapt the code of the question 2A to change the main title, the shape,
the size of the shape, the background color, the “Initial_stage” colors,
and draw axis, to draw this plot:
Colors don’t mater (just change them), but if you want to get the
same as mine: #f39c12, #117a65, #2980b9 and #884ea0. There are the same
as those of the question 1C.
Shapes don’t mater neither, but I use the shape n°4 and the shape
n°16.
Answer
ggplot(clinical_data, aes(x = Age, y = OS, color = Initial_stage, shape = Death_or_alive)) +
geom_point(size = 3) +
labs(title = "Repartition of the Age according to the OS") +
theme_classic() +
scale_shape_manual(values=c(16, 4)) +
scale_color_manual(values = c("I" = "#f39c12", "II" = "#117a65", "III" = "#2980b9", "IV" = "#884ea0"))Question 2C :
Adapt the code of the question 2B to add an additional dashed vertical
and horizontal lines, and the x and y values of this axis, to draw this
plot:
The additional lines represent the point with the maximum “OS”.
Tips
To find the values for the positions of the additional lines and for
the additional texts side the lines, you can compute them before to draw
the plot. You have to find the maximum value for “OS”, so check the max()
function. Then to find the “Age” of the maximum value of “OS” you should
filter the data.frame.
Answer for the “Age” of the maximum value of “OS”
Answer
max_OS <- max(clinical_data$OS)
Age_with_max_OS <- clinical_data[clinical_data$OS == max_OS,"Age"]
ggplot(clinical_data, aes(x = Age, y = OS, color = Initial_stage, shape = Death_or_alive)) +
geom_point(size = 3) +
labs(title = "Repartition of the Age according to the OS") +
theme_classic() +
scale_shape_manual(values=c(16, 4)) +
scale_color_manual(values = c("I" = "#f39c12", "II" = "#117a65", "III" = "#2980b9", "IV" = "#884ea0")) +
geom_hline(yintercept = max_OS, linetype="dotdash", color="red") +
geom_vline(xintercept = Age_with_max_OS, linetype="dotdash", color="red") +
geom_text(data = NULL, x = 31, y = max_OS + 1.2, label = max_OS, size = 3, color = "red") +
geom_text(data = NULL, x = Age_with_max_OS + 1 , y = 0.5, label = Age_with_max_OS, size = 3, color = "red", angle=90)Question 3A :
Do the distribution of the “Mutation_Type”, classified by the
“Previous_immunotherapy” according to the “Previous_radiotherapy”,
represented by this plot :
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here “OS” is considered as continuous variable.
Answer
ggplot(clinical_data, aes(x = OS, fill = Mutation_Type)) + geom_histogram() + facet_grid(Previous_immunotherapy ~ Previous_radiotherapy)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Question 3B :
Adapt the code of the question 2A to add a main title, categories titles
(with only “yes” or “no”, we don’t know what is it) and change the
theme:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Answer
ggplot(clinical_data, aes(x = OS, fill = Mutation_Type)) +
geom_histogram() +
facet_grid(Previous_immunotherapy ~ Previous_radiotherapy, labeller = label_both) +
labs(title = "Distribution of mutated genes with previous immunotherapy and/or radiotherapy") +
theme_bw()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Question 4 :
Combine plots (thanks to patchwork) from question 1C, question 2C and
question 3B, add a main title and tags, and save it in pdf format.
Answer
# Question 1C
pQ1C <- ggplot(clinical_data, aes(x = NSCLC_type , fill = Initial_stage)) +
geom_bar() +
coord_flip() +
labs(title = "Distribution of the NSCLC type by Initial stage", x = "", y = "Count") +
theme_classic() +
scale_x_discrete(limits = NSCLC_ordered) +
scale_fill_manual(values = c("I" = "#f39c12", "II" = "#117a65", "III" = "#2980b9", "IV" = "#884ea0"))
# Question 2C
pQ2C <- ggplot(clinical_data, aes(x = Age, y = OS, color = Initial_stage, shape = Death_or_alive)) +
geom_point(size = 3) +
labs(title = "Repartition of the Age according to the OS") +
theme_classic() +
scale_shape_manual(values=c(16, 4)) +
scale_color_manual(values = c("I" = "#f39c12", "II" = "#117a65", "III" = "#2980b9", "IV" = "#884ea0")) +
geom_hline(yintercept = max_OS, linetype="dotdash", color="red") +
geom_vline(xintercept = Age_with_max_OS, linetype="dotdash", color="red") +
geom_text(data = NULL, x = 31, y = max_OS + 1.2, label = max_OS, size = 3, color = "red") +
geom_text(data = NULL, x = Age_with_max_OS + 1 , y = 0.5, label = Age_with_max_OS, size = 3, color = "red", angle=90)
# Question 3B
pQ3B <- ggplot(clinical_data, aes(x = OS, fill = Mutation_Type)) +
geom_histogram() +
facet_grid(Previous_immunotherapy ~ Previous_radiotherapy, labeller = label_both) +
labs(title = "Distribution of mutated genes with previous immunotherapy and/or radiotherapy")+
theme_bw()
# Combination and save
pdf(file ="my_final_combined_plots.pdf", width = 25, height = 6)
(pQ1C + pQ2C + pQ3B) + plot_annotation(title = 'Summary of the study',
tag_levels = 'A',
tag_prefix = 'Fig. ',
theme = theme(plot.title = element_text(size = 15)))
dev.off()Extensions and ressources
Packages extending ggplot2:
ggrepel: smart geom_text and geom_label placementcowplot: several plots on one pagepatchwork: arrange plots togetherggraph: visualise graphggmap: plot data on a mapfactoextra: visualise results of factorial analysis (PCA, CA, MCA, MFA, …)ggbio: visualise genomic dataggdendro: visualise dendrograms and treesggthemes: additionnal themesggsci: palettes inspired by scientific journal, science fiction and TV showsggedit: manually edit ggplot object (change theme settings…)gganimate: create animated plotsggforce: new functionnalitiessurvminer: draw survival curvesggridges: Ridgeline plotsggiraph: makes ggplot interactive
Resources:
- http://ggplot2.tidyverse.org
- rstudio cheatsheet https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf
- ggplot2 book https://ggplot2-book.org/
- Theoretical aspects of ggplot2 (Layered Grammar of Graphics) http://www.stat.columbia.edu/~gelman/bayescomputation/Wickham2010.pdf
- General advices on data visualisation https://serialmentor.com/dataviz/